Association between use of systematic reviews and national policy recommendations on screening newborn babies for rare diseases: systematic review and meta-analysis

Abstract Objective To understand whether international differences in recommendations of whether to screen for rare diseases using the newborn blood spot test might in part be explained by use of systematic review methods. Design Systematic review and meta-analysis. Data sources Website searches of 26 national screening organisations. Eligibility criteria for study selection Journal articles, papers, legal documents, presentations, conference abstracts, or reports relating to a national recommendation on whether to screen for any condition using the newborn blood spot test, with no restrictions on date or language. Data extraction Two reviewers independently assessed whether the recommendation for or against screening included systematic reviews, and data on test accuracy, benefits of early detection, and potential harms of overdiagnosis. Analysis The odds of recommending screening according to the use of systematic review methods was estimated across conditions using meta-analysis. Results 93 reports were included that assessed 104 conditions across 14 countries, totalling 276 recommendations (units of analysis). Screening was favoured in 159 (58%) recommendations, not favoured in 98 (36%), and not recommended either way in 19 (7%). Only 60 (22%) of the recommendations included a systematic review. Use of a systematic review was associated with a reduced probability of screening being recommended (23/60 (38%) v 136/216 (63%), odds ratio 0.17, 95% confidence interval 0.07 to 0.43). Of the recommendations, evidence for test accuracy, benefits of early detection, and overdiagnosis was not considered in 115 (42%), 83 (30%), and 211 (76%), respectively. Conclusions Using systematic review methods is associated with a reduced probability of screening being recommended. Many national policy reviews of screening for rare conditions using the newborn blood spot test do not assess the evidence on the key benefits and harms of screening.


Introduction
Worldwide, the conditions screened for by the newborn blood spot test vary widely, 1 2 with the number ranging from five to 60 on screening panels. 3 4 Effective screening programmes can save lives, whereas ineffective programmes can do more harm than goodfor example, through overdiagnosis, the physical and psychological consequences of false positive test results, and opportunity costs for the healthcare system. It is not known whether the differences between countries result from genuine differences in disease prevalence or healthcare systems and priorities, or from differences in the evidence review process used to generate policy, 5 in particular the use of systematic reviews.
Since Wilson and Jungner produced their World Health Organization report on screening in 1968, there has been a divergence in the methods used internationally for policy making about screening. 6 In Denmark, Finland, France, Germany, Italy, the Netherlands, Sweden, the UK, Australia, and New Zealand, national and regional organisations have updated and amended the Wilson and Jungner principles to fit their local context and to use their own versions to make policy recommendations and decisions about screening. 7 In the United States, the US Preventative Services Task Force has developed an analytical framework that is adapted to the particular circumstances of each review. 8 This includes three key elements that might determine the balance of benefits and harms from implementing screening for a condition: test accuracy for detecting the condition doi: 10.1136/bmj.k1612 | BMJ 2018;361:k1612 | the bmj of interest; the benefit of early detection, and therefore treatment after screening compared with later detection following symptoms; and the extent of overdiagnosis, one of the main harms of screening owing to the detection of disease that would never have caused symptoms within someone's lifetime.
We analysed national policy making decisions about which conditions to screen for using the newborn blood spot test to determine whether systematic reviews were undertaken and if this was associated with the final recommendation of whether to implement screening. We also scored the extent to which each decision making process considered test accuracy, the benefit of early detection, and overdiagnosis, and investigated associations with the final decision.

Search
We searched the websites of national policy making organisations for all documentation related to the newborn blood spot test (see appendix 1 for organisations). A previous systematic review was used to identify these organisations. 7 We asked a panel of international screening experts to identify any further documentation, and we searched website databases of WHO, the European Council, the European Commission, and the European Observer. From the included documentation, we extracted and synthesised data describing the process of reaching decisions for every condition considered for inclusion on the newborn blood spot screening panel, with no restrictions on date or language.
The initial search for this review was conducted on the websites of these national organisations on 18 September 2015 using search terms for newborn blood spot screening and the conditions included by the American College of Medical Genetics (see appendix 2 for full search terms). We emailed each organisation and country experts requesting any further documentation on newborn blood spot screening. If either referred us to associated but different organisations, we searched those websites using the same search terms between 18 September 2015 and April 2016 (for example, in the US we searched the Preventative Task Force website and found that recommendations for the blood spot test are made by the Advisory Committee on Heritable Disorders in Newborns and Children. Similarly, after contacting the Ministry of Social Affairs and Health in Finland, we found that relevant reviews are on the Finnish HTA website). Overall, we searched the websites of 26 organisations.

Inclusion criteria
Two reviewers independently assessed each item against the inclusion criteria, with disagreements resolved by consensus. The inclusion criteria were: Source of documents-only information from national policy making organisations was included. We excluded recommendations by state or regional organisations unless endorsed by a national policy making organisation, and recommendations by clinical societies or other groups unless they were explicitly used to underpin national policy decisions.
Type of document-we included all journal articles, papers, legal documents, presentations, conference abstracts, or reports from the website of the organisation and all those obtained through personal communication with policy makers, officials, and researchers in all included countries. We did not include patient information.
Language-there were no restrictions on language. For documents not in English we used automated translation software, with formal translation by native speakers if further clarity was needed.
Subject of documents-we included material on whether to start or stop screening or material that evaluated the effectiveness of current or proposed screening programmes for any condition using the newborn blood spot test. If we also found reviews of conditions for that country, we included documents describing standards for national evidence review processes for screening.
Method of reaching recommendation-we included recommendations produced using all methods, including evidence from systematic reviews, expert panels, or any approach that resulted in a recommendation or decision or described why or how a decision was made.

Data extraction
Two reviewers independently extracted data, with disagreements resolved by consensus and involvement of a third reviewer if necessary (see appendix 3 for data extraction sheet). Data extraction was carried out in two steps. Firstly, we recorded whether any of the review documentation included a systematic review. The criteria for defining a systematic review were inclusive; we required either two parts of the search strategy (for example, search terms, databases, dates) to be described or any details of systematic evidence selection after a search (for example, inclusion criteria, PRISMA flow chart) to be described (table 1). We were also inclusive about the question posed by the systematic review, which could address any aspect of the evidence relating to whether or not to screen for a condition, including benefits of early detection through screening, disease prevalence, test accuracy, effects of false positive test results, overdiagnosis or any other harm, and clinical course of the condition.
Criteria for defining a systematic review: A: describes two parts of the search strategy (eg, search terms, databases, dates), or B: describes any details of systematic evidence selection after a search (eg, inclusion or exclusion criteria, numbers at abstract and full text sift, PRISMA flow diagram). Each country was defined as having undertaken a systematic review for each condition if either criterion A or B, or both, were met.
The review topic could be about any aspect of screening for the disease under consideration (eg, benefits of early detection through screening, disease prevalence, test accuracy, effects of false positive test results, overdiagnosis or any other harm, clinical course).
Secondly, we assessed three key elements characterising the main benefits and harms of screening: test accuracy, benefits of early detection through screening, and overdiagnosis. These characteristics were selected on the basis of our review of published frameworks for test evaluation [9][10][11][12] to identify all mechanisms recognised to affect patient health as a result of undergoing testing or taking part in a screening programme.
Table 2 details the scoring system for the assessment of evidence related to the three key elements. We measured whether and how the evidence was assessed; not what the evidence showed about that particular condition. A score of zero means that the element was not mentioned in the documentation, with increasing scores up to a score of 5 indicating greater and more systematic use of evidence and increasing assessment of internal and external validity. A score of ≥3 for any of the three key elements indicates that a systematic review was used for that recommendation. In some cases a systematic review was used and recorded as such but the review did not cover test accuracy, benefit of early detection, or overdiagnosis. In such cases, the evidence would score <3 for these three key elements in the secondary analyses but was still coded as a systematic review in the primary analysis (meta-analysis).
Test accuracy determines how many people are detected early with true positive test results and how many are potentially harmed by false positive results. The scoring system refers to whether there is an accurate test, which can include any test accuracy metrics such as sensitivity, specificity, and positive or negative predictive value. Consideration of the existence of a test is a necessary prerequisite but does not form part of the scoring system. The benefit of early detection leading to early treatment is the primary mechanism through which screening provides benefit. The scoring system refers specifically to the benefit of early treatment, not whether there is an effective treatment, which is also a prerequisite. Overdiagnosis in this context is defined as detection of disease at screening that would never have produced symptoms within someone's lifetime.
We were inclusive in the language used to describe overdiagnosis, including asymptomatic phenotypes, penetrance, and any description of people remaining symptom-free to adulthood.

Statistical analysis
Cohen's κ was used to calculate inter-reviewer reliability for judgments of whether a systematic review was used, scores for the test accuracy, benefits of early detection, and overdiagnosis, and whether screening was recommended, with linear weighting when more than two categories existed, and interpretation according to Landis and Koch. 25 We report proportions of included decisions that used systematic review methods; the methods used to assess test accuracy, benefit of early detection, and overdiagnosis (graphs show distribution of scores); and the final recommendation tabulated by country. To determine whether the patterns observed were purely historical we repeated the analysis including only policies since 2012.
We computed the odds ratio for recommending screening for each condition if a systematic review was used compared with recommending screening if a systematic review was not used. To get an overall estimate of the impact of using systematic reviews on policy formation of recommendations, we meta-analysed odds ratios across conditions. This stratified approach removes the confounding effect of clinical condition. Only conditions where there were discrepancies in recommendations (ie, at least one recommendation for and one recommendation against screening) and in methods (ie, at least one recommendation with systematic review evidence and one without) could contribute to this comparison and were included in the meta-analysis. We calculated an overall effect estimate using Mantel-Haenszel fixed effects meta-analysis with a 0.1 zero cell correction. 26 27 The analyses were repeated with no and other values of zero cell correction (0.5, 0.01, 0.001), using the DerSimonian and Laird random effects method with zero cell correction 0.5, and the Peto method. 27 [we] assessed the conditions selected for additional analysis, which was based on a review of original literature including treatment options, screening potential and experience." 82 No further details of the review process were provided Canada Phenylketonuria Yes Section 17 outlined the review methods, and included: source searched (Medlline only), search term ( phenylketonuria), and date limit. 95 Meets criterion for describing two parts of the search strategy UK Long chain 3-hydroxyacyl-CoA dehydrogenase deficiency Yes "Chapter 5 provides a methodology for the systematic review." This included the search strategy, resources searched (electronic databases and reference lists of identified articles), search terms, date limit, language restrictions, and number of reviewers; and the inclusion and exclusion criteria. 22 Meets both criteria for defining a systematic review because at least two parts of the search strategy and inclusion criteria were described  We tested for heterogeneity using Cochran's Q and described its magnitude using the I 2 statistic. All analyses used Stata version 13. Spearman correlation was used to univariately assess the relation between policy recommendations and the rigor of methods used to assess test accuracy, the benefits of early detection and treatment, and the risks of overdiagnosis (only systematic reviews of conditions for which there were recommendations both for and against screening were included in this analysis).

Patient involvement
No patients were involved in setting the research question or the outcome measures, nor were they involved in developing plans for design or implementation of the study. No patients were asked to advise on interpretation or writing up of results. We will work with patients and members of the public to help disseminate findings to appropriate audiences.

Description of evidence
We identified 134 policy documents (fig 1), 108 of which were from screening organisation websites and 26 referred from experts. Overall, 41 documents were excluded. Reasons for exclusion were: description of current screening practice, policy, or laws; list of conditions included or considered for inclusion in programme; document stating decision to change programme; document not from national organisation; duplication of included information; patient information; description of organisation or study; no investigation of an included condition; contracts; and not newborn blood spot test (see appendix 4 for references of exclusions with reasons). After exclusions, 93 reports remained.  Two covered Australia and New Zealand together, 30 33 two were from Australia, 61 87 four from Belgium, 24 68 69 105 three from Canada, 19 37 95 two from Denmark, 17 82 three from Finland, 31 59 85 eight from France, 20 34 35 71 72 99 100 104 three from Germany, 106-108 one from Italy, 102 four from Japan, 53 66-88 four from the Netherlands, 14 79-81 two from New Zealand, 15 16 24 from Spain, 18  Review methods used Overall, the 93 reports included 104 conditions from 14 countries, giving a total of 276 recommendations (units of analysis). Cohen's κ for inter-reviewer reliability was 0.91 (near perfect) for whether a systematic review was used, 0.73 (substantial) for test accuracy score, 0.47 (moderate) for benefit of early detection score, 0.62 (substantial) for overdiagnosis score, and 0.97 (near perfect) for the final recommendation of each review.
Of the 276 recommendations, 159 (58%) were in favour of screening, 98 (36%) were against screening, and no suggestion was made either way in 19 (7%). Sixty (22%) of the recommendations included evidence from a systematic review. Of the recommendations, evidence for test accuracy, benefits of early detection, and overdiagnosis was not considered in 115 (42%), 83 (30%), and 211 (76%), respectively. Of the 60 recommendations that employed systematic review methods, 21 systematic reviews covered test accuracy, benefits of early detection, and overdiagnosis. Figure 2 shows the full distribution of scores. Similar patterns are observed if only the most recent 154 reviews (from 2012 onwards) are included (see supplemental figure 1). Table 3 shows a full breakdown by country.

Association between evidence review methods and recommendations
Of the 60 decisions that included a systematic review, 23 (38%) recommended screening, 29 (48%) recommended not to screen, and eight (13%) made no recommendation either way. The corresponding results for the 216 decisions not based on evidence Fig 1 | Flow of documents through study. One paper was included from Italy, but no national decisions in the analysis, because one paper that will be used in part to underpin the national decisions has been published, but the national review process is incomplete and recommendations are yet to be made The meta-analysis included 24 conditions, each with between two and eight reviews, with 104 reviews in total. The odds of making a decision to recommend screening was lower when a systematic review was used than when no systematic review was used (odds ratio 0.17, 95% confidence interval 0.07 to 0.43, P<0.001; fig  3). Owing to the small sample sizes, little heterogeneity existed between conditions (χ 2 =12.45 (df=23), P=0.96), with none of the total variance due to variability between conditions (I 2 =0%). Sensitivity analyses using different zero cell corrections and meta-analysis methods did not alter the results and were all highly significant (P<0.001), although increasing the zero cell correction did slightly reduce the effect size (see appendix 2).
Review scores for benefits of early detection and overdiagnosis were not statistically significantly correlated with the recommendation of the review, although there was an association between greater consideration of test accuracy in the review and a recommendation against screening (table 4).  0  1  2  3  4  5  0  1  2  3  4  5  0  1  2  3  4  Confidence intervals were wide, narrowly excluding zero for test accuracy and just overlapping zero for overdiagnosis score.

discussion
We assessed whether use of a systematic review affects national decisions on whether to screen for a range of conditions using the newborn blood spot test. After full text review, we included 93 reports assessing a total of 104 conditions across 14 countries, with 276 recommendations. Only 22% of the recommendations were based on evidence from a systematic review. The odds of a decision in favour of screening were lower when a systematic review was used as part of the policy decision (0.17, 95% confidence interval 0.07 to 0.43). The evidence on accuracy of the test was not evaluated in 42% of recommendations. Similarly, the evidence around the benefits of early detection and the potential harm of overdiagnosis were not evaluated in 30% and 76% of reviews, respectively. These elements were actually not mentioned in the review documents, which suggests either lack of evidence review or lack of consideration. For each review, the more thoroughly test accuracy was considered the lower was the probability that screening would be recommended. A weak association was found in the same direction for thoroughness of assessment between both early treatment benefits and overdiagnosis and screening recommendations. However, power was too limited to assess these associations, owing to the low scores creating a floor effect.

Strengths and limitations of this study
The strengths of this study include the large number of documents extracted using systematic methods, with no restrictions on date, language, or country, and the use of meta-analytical methods to determine whether there was a consistent effect across different conditions thus accounting for confounding by condition. Also we used automated translation software, which enabled broader inclusion criteria, although errors might have occurred in translation. To mitigate this risk, we used formal translation for documents or parts of documents where the automated translation was unclear to reviewers. In addition, the review of grey literature documenting national policy decisions is challenging in itself, particularly on reproducibility since websites change over time. We also contacted every organisation for further documents, but it is possible that more systematic reviews were used than were published or referenced by the national websites of policy makers or identified through personal communication.
Although we found an association between use of systematic reviews and whether or not a screening programme was recommended, the decision on whether to undertake a systematic review might have been driven by country level factors, as four of the 14 included countries always used a systematic review and four never did. Thus it might be possible that use of systematic review methods acted as a proxy for unmeasured country level confounders, so only tentative conclusions can be drawn.
Comparison with other studies Previous research has highlighted an underuse of systematic reviews in developing policy guidance for screening programmes. A 2006 study reported that systematic reviews were rarely used in production of WHO guidance, a discovery that initiated a major research effort to incorporate greater use  109 Although the research literature concerning measurement of overdiagnosis is extensive, our study systematically investigated whether consideration of potential overdiagnosis is incorporated into national screening policy decision making. Our main finding, however, was that policy reports that did not utilise systematic review methods were more likely to recommend screening, suggesting that rigorous appraisal exposes the absence or unreliability of available evidence. Indeed, several studies have shown differences between expert opinion and research evidence. One study observed that professional recommendations on treatments for acute myocardial infarction communicated through review articles or textbooks often contradicted the best evidence from meta-analysis of trials available at the time of publication. 110 An opinion article argued that experts are more likely to overestimate the effectiveness of interventions based on their own clinical experiences. 111 In fact a systematic review showed that clinicians overestimate the benefits of screening and underestimate the harms. 112 We consider that quality appraisal in systematic reviewing serves as a mechanism to highlight bias in research studies (often biased away from the null). This might explain why expert policy making groups that use systematic reviews are less likely to recommend screening.

Policy implications
This study showed that many national policy decisions about whether to screen for conditions using the newborn blood spot test are being made without systematically reviewing the evidence. One reason for this absence is likely to lay in the absence of evidence from randomised controlled trials, which is unavailable for most conditions included in the newborn blood spot owing to their rarity. Indeed, although many countries have developed robust systems for reviewing new screening programmes, we found that they are often not applied when assessing whether to screen for additional rare diseases using the newborn blood spot test. Yet it remains essential to make evidence based policy decisions because once screening programmes are started they are difficult to stop. 12 When trial evidence is not available, a review of whether to screen for each condition should consider the evidence for each pathway to patient benefit and harm resulting from introducing a screening test, in particular: the test's ability to discern true disease, any resulting potential for patient harm from overdiagnosis, and the benefits of early detection. Although many reviews considered whether subsequent diagnostic tests and treatments were available to manage screened patients, most did not consider evidence for the screening test's accuracy, nor whether earlier detection and treatment after screening were beneficial to patients compared with later detection of symptoms and treatment. These three elements are not an exhaustive list of benefits and harms (for example, we did not examine the effect of screening results to other family members); however, there is broad agreement that they are key indicators of effectiveness. 10 11 We recommend that whenever possible a systematic review of the literature should be undertaken as part of policy decisions on whether to commence screening. Full systematic reviews that assess each key element of a screening programme can be expensive and time consuming-particularly in the absence of trial evidence, and we propose more international collaboration to undertake such reviews. Although the health systems, prevalence, culture, and willingness to pay thresholds might differ by country, the evidence about test accuracy, benefits of early detection, and overdiagnosis are international bodies of evidence, and collating them will be the same regardless of country. Only concerns about applicability will differ.

Conclusions
Further research is required to understand why policy makers do not employ systematic review methods in their evaluations of evidence. Possible reasons include costs, time, and knowledge and beliefs about systematic reviews. 113 Undertaking international reviews for conditions across several countries would reduce overall costs. These reviews could be adapted to local populations and prevalence and improve rigour while reducing discrepancies in screening internationally.
Contributors: ST-P designed the study, was first reviewer, undertook the analysis, drafted the manuscript, and is the guarantor. CS and LFdR were second reviewers. FS ran the searches. AC assisted with study design and write up. JD contributed to study design and planned the statistical analysis. All authors contributed to the write up and approved the final version.